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Abstract 

Background: The phylogenetic Mean Pairwise Distance (MPD) is one of the most popular measures for computing 
the phylogenetic distance between a given group of species. More specifically, for a phylogenetic tree T and for a set 
of species R represented by a subset of the leaf nodes of T, the MPD of/? is equal to the average cost of all possible 
simple paths in Tthat connect pairs of nodes in R. 

Among other phylogenetic measures, the MPD is used as a tool for deciding if the species of a given group/? are closely 
related. To do this, it is important to compute not only the value of the MPD for this group but also the expectation, 
the variance, and the skewness of this metric. Although efficient algorithms have been developed for computing the 
expectation and the variance the MPD, there has been no approach so far for computing the skewness of this measure. 

Results: In the present work we describe how to compute the skewness of the MPD on a tree T optimally, in B(n) 
time; here n is the size of the tree T. So far this is the first result that leads to an exact, let alone efficient, computation 
of the skewness for any popular phylogenetic distance measure. Moreover, we show how we can compute in B(n) 
time several interesting quantities in T, that can be possibly used as building blocks for computing efficiently the 
skewness of other phylogenetic measures. 

Conclusions: The optimal computation of the skewness of the MPD that is outlined in this work provides one more 
tool for studying the phylogenetic relatedness of species in large phylogenetic trees. Until now this has been infeasible, 
given that traditional techniques for computing the skewness are inefficient and based on inexact resampling. 

Keywords: Algorithms for phylogenetic trees, Mean pairwise distance, Skewness 



Background 

Communities of co-occuring species may be described as 
"clustered" if species in the community tend to be close 
phylogenetic relatives of one another, or "overdispersed" 
if they are distant relatives [1]. To define these terms we 
need a function that measures the phylogenetic related- 
ness of a set of species, and also a point of reference for 
how this function should behave in the absence of ecolog- 
ical and evolutionary processes. One such function is the 
mean pairwise distance (MPD); given a phylogenetic tree 
T and a subset of species R that are represented by leaf 
nodes of T, the MPD of the species in R is equal to aver- 
age cost of all possible simple paths that connect pairs of 
nodes in R. 
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To decide if the value of the MPD for a specific set of 
species R is large or small, we need to know the average 
value (expectation) of the MPD for all sets of species in 
T that consist of exactly r = \R\ species. To judge how 
much larger or smaller is this value from the average, we 
also need to know the standard deviation of the MPD for 
all possible sets of r species in T. Putting all these values 
together, we get the following index that expresses how 
clustered are the species in R [1]: 

NRI = MPD(r?jR) ~ expec MPD (T,r) 
s^mpd (T,r) 

where MPD(T,#) is the value of the MPD for R in T, 
and expec(T) and s<3?mpd(T, r) are the expected value and 
the standard deviation respectively of the MPD calculated 
over all subsets of r species in T. 

In a previous paper we presented optimal algorithms for 
computing the expectation and the standard deviation of 
the MPD of a phylogenetic tree T in 0(n) time, where n 
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is the number of the edges of T [2]. This enabled exact 
computations of these statistical moments of the MPD on 
large trees, which were previously infeasible using tradi- 
tional slow and inexact resampling techniques. However, 
an important problem remained unsolved; quantifying 
our degree of confidence that the NRI value observed in 
a community reflects non-random ecological and evolu- 
tionary processes. 

This degree of confidence can be expressed as a statisti- 
cal P value, that is the probability that we would observe 
an NRI value as extreme or more so if the community 
was randomly assembled. Traditionally, estimating P is 
accomplished by ranking the observed MPD against the 
distribution of randomized MPD values [3]. If the MPD 
falls far enough into one of the tails of the distribution 
(generally below the 2.5 percentile or above the 97.5 per- 
centile, yielding P < 0.05), the community is said to 
be significantly overdispersed or significantly clustered. 
However, this approach relies on sampling a large number 
of random subsets of species in T , and recomputing the 
MPD for each random subset. Therefore, this method is 
slow and imprecise. This problem is exacerbated when it 
is necessary to consider multiple trees at once, arising for 
example from a Bayesian posterior sample of trees [4,5]. 
In such cases, sufficient resampling from all trees in the 
sample can be computationally limiting. 

We can approximate the P value of an observed NRI by 
assuming a particular distribution of the possible MPD 
values and evaluating its cumulative distribution func- 
tion at the observed MPD. Because the NRI measures 
the difference between the observed values and expec- 
tation in units of standard deviations, this yields a very 
simple rule if we assume that possible MPD values are 
normally distributed: any NRI value larger than 1.96 or 
smaller than —1.96 is significant. Unfortunately, the dis- 
tribution of MPD values is often skewed, such that this 
simple rule will lead to incorrect P value estimates [6,7] . 
Of particular concern, this skewness introduces a bias 
towards detecting either significant clustering or signifi- 
cant overdispersion [8]. If the distribution of MPD values 
for a particular tree can be reasonably approximated using 
a skew-normal distribution, calculating the skewness ana- 
lytically would enable us to remove this bias and improve 
the accuracy of P value estimates. In the last part of the 
paper, we describe experiments on large randomly gen- 
erated trees, supporting this argument. Further, when a 
large sample of trees should be considered, the full distri- 
bution of MPD values can be considered as a mixture of 
skew-normal distributions [9,10], greatly simplifying and 
speeding up the process of calculating P values across the 
entire set of trees. 

However, so far there has been no result in the related 
literature that shows how to compute the needed skew- 
ness measure efficiently. Hence, given a phylogenetic tree 



T and an integer r there is the need to design an efficient 
and exact algorithm that can compute the skewness of the 
MPD for r species in T. This would provide the last criti- 
cal piece required for the adoption of a fully analytical and 
efficient approach for analysing ecological communities 
using the MPD and the NRI. 

Our results 

In the present work we show how we can compute effi- 
ciently the skewness of the MPD. More specifically, given 
a tree T that consists of n edges and a positive integer 
r, we prove that we can compute the skewness of of the 
MPD over all subsets of r leaf nodes in T optimally, in 
0 (n) time. For the calculation of this skewness value we 
consider that every subset of exactly r species in T is 
picked uniform with probability out of all possible sub- 
sets that have r species. The main contribution of this 
paper is a constructive proof that leads straightforwardly 
to an algorithm that computes the skewness of the MPD 
in ®(n) time. This is clearly optimal, and it outperforms 
even the best algorithms that are known so far for comput- 
ing lower-order statistics for other phylogenetic measures; 
for example the best known algorithm for computing the 
variance of the popular Phylogenetic Diversity (PD) runs 
in 0(n 2 ) time [2]. 

More than that, we prove how we can compute in @ (n) 
time several quantities that are related with groups of 
paths in the given tree; these quantities can be possi- 
bly used as building blocks for computing efficiently the 
skewness (and other statistical moments) of phylogenetic 
measures that are similar to the MPD. Such an example 
is the measure which is the equivalent of the MPD for 
computing the distance between two subsets of species in 

r [in. 

The rest of this paper is, almost in its entirety, an elab- 
orate proof for computing the skewness of the MPD on 
a tree T in G(n) time. In the next section we define the 
problem that we want to tackle, and we present a group 
of quantities that we use as building blocks for computing 
the skewness of the MPD. We prove that all of these quan- 
tities can be computed in linear time with respect to the 
size of the input tree. Then, we provide the main proof of 
this paper; there we show how we can express the value of 
the skewness of the MPD in terms of the quantities that 
we introduced earlier. The proof implies a straightforward 
linear time algorithm for the computation of the skew- 
ness as well. In the last section we provide experimental 
results that indicate that computing the skewness of the 
MPD can be a useful tool for improving the estimation 
of P values when a skew-normal distibution is assumed. 
There we describe experiments that we conducted on 
large randomly generated trees to compare two different 
methods for estimating P values; one method is based on 
random sampling of a large number of tip sets, and the 
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other method relies in calculating the mean, variance, and 
skewness of the MPD for the given tree. 

Description of the problem and basic concepts 

Definitions and notation Let T be a phylogenetic tree, 
and let E be the set of its edges. We denote the number of 
the edges in T by n, that is n = \E\. For an edge e e E, 
we use w e to indicate the weight of this edge. We use S to 
denote the set of the leaf nodes of T. We call these nodes 
the tips of the tree, and we use s to denote the number of 
these nodes. 

Since a phylogenetic tree is a rooted tree, for any edge 
e e E we distinguish the two nodes adjacent to e into a 
parent node and a child node; among these two, the par- 
ent node of e is the one for which the simple path from 
this node to the root does not contain e. We use Ch(e) to 
indicate the set of edges whose parent node is the child 
node of e, which of course implies that e £ Ch(e). We indi- 
cate the edge whose child node is the parent node of e by 
parent (e). For any edge e e E, tree Tie) is the subtree of 
T whose root is the child node of edge e. We denote the 
set of tips that appear in Tie) as Sie), and we denote the 
number of these tips by 5(e). 

Given any edge e e E, we partition the edges of T into 
three subsets. The first subset consists of all the edges that 
appear in the subtree of e. We denote this set by Off(e). 
The second subset consists of all edges e' e E for which 
e appears in the subtree of e f . We use Anc(e) to indicate 
this subset. For the rest of this paper, we define that e e 
Anc(e), and that e £ Off(e). The third subset contains all 
the tree edges that do not appear neither in Off(e), nor in 
Anc(e); we indicate this subset by Ind(e). 

For any two tips u,v £ S, we use piu, v) to indicate the 
simple path in T between these nodes. Of course, the path 
piu, v) is unique since T is a tree. We use costiu, v) to 
denote the cost of this path, that is the sum of the weights 
of all the edges that appear on the path. Let u be a tip in S 
and let e be an edge in E. We use costiu, e) to represent the 
cost of the shortest simple path between u and the child 
node of e. Therefore, if u e Sie) this path does not include 
e, otherwise it does. For any subset R c S of the tips of 
the tree T, we denote the set of all pairs of elements in R, 
that is the set of all combinations that consist of two dis- 
tinct tips in R, by A iR). Given a phylogenetic tree T and a 
subset of its tips R c S, we denote the Mean Pairwise Dis- 
tance of R in T by MPD(T,R). Let r = \R\. This measure 
is equal to: 



MPD(T,£) 



r(r-l) 



costiu, v) . 



u,v}eA(R) 



Aggregating the costs of paths 

Let T be a phylogenetic tree that consists of n edges and s 
tips, and let r be a positive integer such that r < s. We use 



sk(T, r) to denote the skewness of the MPD on T when 
we pick a subset of r tips of this tree with uniform proba- 
bility. In the rest of this paper we describe in detail how we 
can compute sk(T, r) in Oin) time, by scanning T only a 
constant number of times. Based on the formal definition 
of skewness, the value of sk(T, r) is equal to: 



sk(T, r) = E ReSu b(S,r: 



'MPD(T,£) - expec(T,r) N 
var(T, r) 



E R eSub(S,r) [MPD 3 (T,£)] - 3 • var(T, r) 2 - expec(T, r) 3 



var(T, r) 3 



(i) 



where expec(T, r) and var(T, r) are the expectation and 
the variance of the MPD for subsets of exactly r tips in T, 
and E Re sub(S,r)[ •] denotes the function of the expectation 
over all subsets of exactly r tips in S. In a previous paper, 
we showed how we can compute the expectation and the 
variance of the MPD on T in 0(n) time [2]. Therefore, 
in the rest of this work we focus on analysing the value 
EReSub(S,r) [ MPD 3 (T, R)] and expressing this quantity in a 
way that can be computed efficiently, in linear time with 
respect to the size of T. 

To make things more simple, we break the description 
of our approach into two parts; in the first part, we define 
several quantities that come from adding and multiply- 
ing the costs of specific subsets of paths between tips of 
the tree. We also present how we can compute all these 
quantities in 0(n) time in total by scanning T a constant 
number of times. Then, in the next section, we show how 
we can express the skewness of the MPD on T based on 
these quantities, and hence compute the skewness in 0(n) 
time as well. Next we provide the quantities that we want 
to consider in our analysis; these quantities are described 
in Table 1. In this table but also in the rest of this work, 
for any tip u e S, we consider that SQ(w) = SQ(e), and 
TCiu) = TC(e), such that e is the edge whose child node 
is u. 

We provide now the following lemma. 

Lemma L Given a phylogenetic tree T that consists ofn 
edges, we can compute all the quantities that are presented 
in Table 1 in Oin) time in total 

Proof Each of the quantities (I)-(X) in Table 1 can be 
computed by scanning a constant number of times the 
input tree T, either bottom-up or top-to-bottom. For 
computing quantity (XI) we follow a more involved divide- 
and-conquer approach. 

We showed in a previous paper how we can compute 
quantity (I) and the quantities in (III) for all e e E in Oin) 
time in total [2]. 
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Table 1 The quantities that we use for expressing the skewness of the MPD 



WTC(1~\ — \^ rnsfdi \A 

\) ) — 7 LUbLyU/V) 


\\) V-D\ 1 ) — 7 LUjL \U i V) 


[u,v}eA(S) 


{u,v}eA(S) 


ill) ve G t, IL(e) = 7. C0Sl(U,V) 


IV) ve g t, bU(e) = >^ cost (u, v) 


{uy}eA(S) 
eep(u,v) 


{u,v}eA(S) 
eep(u,v) 


V) Ve g E, Mult(e) = J] TC(u) • TC(v) 


VI) Vu g S, SM(u) = J2 cost ^' ^ ' TC ^) 


{u,v}eA(S) 
eep(u,v) 


veS\{u} 


VII)Vee£ / TC SU b(e)= J] cost(u,e) 


VIII) Ve g E, SQsub(e) = Y cost2 ( u > e ) 


ueS(e) 


ueS(e) 


IX) Ve g E, PC(e) = J]co5t(u ; e) 


X) Ve g E, PSQ(e) = J]cost 2 (Li,e) 


ueS 


ugS 


XI) Ve g E, QD(e) = ^ I J] cost(u f v) I 




ueS(e) \vGS(e)\{u} / 





For an edge e e E, the quantity in (VII) can be written 

as: 

TCsubO) = Y cost ( u > e ) = Y w l • 

wGS(e) /GOff(e) 

We can compute this quantity for every e e E in linear 
time as follows; in the first scan we compute for every edge 
e the number of leaves s(e) in Tie). This can be done in 
Oin) time by computing in a bottom-up manner sie) as 
the sum of the numbers of tips sie'), Ve' e Ch(e). Then, 
we can compute TC su b ie) by scanning bottom-up the tree 
using the following formula: 

TC sub ie) = Y wi • 5(/) + TC sub (0 • 
/eCh(e) 

For quantity (VIII), for any e e Ewe have that: 

s Qsub( e ) = Y cosi;2 ( u > e ) 

ueS(e) 

= Y Wl Y 2 ' w k-sik)+ Y w r s iO 

/eOff(e) £eOff(/) /eOff(e) 

= 2-w / -TC su b(0 + ^.5(/). 

/eOff(e) 

Then SQ su b(e) can be computed for every edge e e E by 
scanning T bottom up and evaluating the formula: 

SQsubW^ Y 2-w / -TC su b(0+^.5(/)+SQsub(0. 

leCh(e) 

For every edge e in T, quantity (IV) can be written as: 

cost 2 iu, v) = 2 ^ wi • • NumPath(e, /, /c) 



NumPath(e, /) is equal to the number of simple paths that 
connect two tips in T and which also contain both edges 
e and /. Therefore, for any e e Ewe have: 

cost 2 iu, v) = 2(s — sie)) ^ w\ Y^ w k - s(k) 
/eOff(e) £eOff(/) 



{u,v}eA(S) 
eep(u,v) 



l,keE 



+ NumPath(e, /) . 



leE 



In the last expression, value NumPath(e, /, k) is equal to 
the number of simple paths that connect two tips in T and 
which also contain all three edges e, / and k. The quantity 



{u,v}eA(S) 
eep(u,v) 



+ 2 Y m(s-s(i)) Y w k' s i k ) 

/eAnc(e) keOf£(e) 

+ 2 -sie) Y Wiis-sil)) Y w k 



/eAnc(e) 



£eAnc(e) 
keO&(t) 



+ 2 -5(e) J] w/ X w k -sik) 
/elnd(e) *eOff(/) 

+ 2 X iv/. 5(/) X ^-5(/c) 
/elnd(e) £eOff(e) 

+ 2 -5(e) Y Wl Y w k' s i k ) 
/eAnc(e) £elnd(7) 

+ (5 - 5(e)) Y w2 r s i + s i e ) 

/eOff(e) 

Y w 2 r is-sil))+sie) Y w r s i l ) 

/eAnc(e) /elnd(e) 

= (5-5fe))-SQ sub (e) 
+ £ vv,(s-s(/))(2-TC sub (e)+iv r s(e)) 



/eAnc(e) 



/ 



+ 2 -5(e) X W/(5-5(/)) 

/eAnc(e) 



E 



./:eAnc(e) 
VeOff(/) / 



+ s(e) Y w/(2.TC 8U b(/) 

/elnd(e) 

+ w/-5(/))+2. TCsub (e) X w l' s ® 

/elnd(e) 

+ 2 -5(e) X w/ Y w k'Sik). 

/eAnc(e) Arelnd(/) 

(2) 
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We explain now how we can compute the six quanti- 
ties in (2) in 0(n) time, assuming that we have already 
computed TC su b(e) and s(e) for every e e E. To make 
the description simpler, we show in detail how we can 
compute the second and fourth quantities that appear in 
the last expression; it is easy to show that the rest of the 
quantities in (2) can be calculated in a similar manner. 

For any e e E, we denote the second quantity as follows: 

SUMi(e)= wi(s-s(l))(2.TC suh (e) + w r s(e)). 

/eAnc(e) 

We also define the following quantities: 
SUM M (e)= J2 w l( s ~ s ( l ))> 

/eAnc(e) 

and 

SUM 1B (e)= £ wf(s-s(l)). 

/GAnc(e) 

We can calculate SUMi (e) for every edge e by travers- 
ing the tree top-to-bottom and evaluating the following 
expressions: 

SUML4 0) = w e (s - s(e)) + SUMiA(parent(e)) . 

SUM 1B (e) = w 2 e (s- s(e)) + SUM 1B (parent (e)) . 

SUMi(e) = 2 • TC sub (£0 • SUM 1A (e) + SUM 1B (e) • s(e) . 



To compute the fourth quantity in (2), we use the fol- 
lowing quantity: 

SUM 2 (e)= "/(2.TC su b(/) + wr5(/)). 

/€Off(e) 

This quantity can be evaluated in 0(n) time for every e e 
E with a bottom-up scan of the tree. We also consider the 
following value which we can precompute in 0(n) time: 

SUM 2 (T) = J2 W *V ' TC sub(e) + w e • s(e)) . 

eeE 

For every edge e e E we calculate in a top-to-bottom 
manner the formula: 

SUM 3 (e) = w e (2-TC suh (e)+w e >s(e))+S\JM 3 (parent(e)). 

Then for each tree edge e, the fourth quantity in (2) can be 
computed in constant time as follows: 

s(e) J2 M//(2.TC su b(/) + M//-5(/)) 

/€lnd(e) 

= s(e) • (SUM 2 (T) - SUM 2 (e) - SUM 3 (e)) . 

The remaining quantities in (2) can be computed in a 
quite similar manner as the two quantities that we already 
described. 



Quantity (II) in Table 1 is equal to: 

CB(T) = cost3 ( u > v ) 

{u,v}eA(S) 

= ^ w e ^ cost 2 (u, v) — ^ w e • SQ(e) . 



eeE {u,v}eA(S) 
eep{u,v) 



eeE 



We have already presented how to compute SQ(e) for 
every edge e in T in 0(n) time in total, hence we can also 
compute CB(T) in 0(n) time by simply summing up the 
values w e • SQ(e) for every edge e in the tree. For quantity 
(V) it holds that: 

Mult(e) = TC ^) * TC ( V ) 

{u,v}eA(S) 
eep{u,v) 

ueSie) veS-S(e) 

= ( TC ^)| [E TC ^- E tc ^)|- 



<ueS(e) 



veS 



ueS(e) 



Since we have already computed TC(v) for every tip 
v e S, we can trivially evaluate J2veS TC(v) in 0(n) time. 
Hence, to compute quantity (V) it remains now to calcu- 
late the values SUM4(e) = J2 ue s( e ) TC(w) for every edge 
eeE. We can do this in 0(n) time as follows: at each 
tip u e S we store the value TC(u) that we have already 
computed. Then we scan T bottom-up and we calculate 
SUM4(e) by summing up the values SUM4(/) for all edges 
/eCh(e). 

Let u be a tip in S, and let e be the edge which is adjacent 
to u. Then, quantity (VI) is equal to: 

SM(u) = cost ^ v ) * TC ( V ) 

veS\{u} 

= E w i E TC ^ + E w i E TC ^ 

/GAnc(e) vg5\5(/) /€lnd(e) vg5(/) 

= E "/[E TC ^- E tc ^i 



/eAnc(e) 



v veS 



xeS(l) 



+ E^ E TC ^ - E ^ E TC ^ • 

/g£ vg5(/) /€Anc(e) veS(l) 

In the last expression, value J2 ve s TC(v) can be com- 
puted in 0(n) time, given that we have already computed 
TC(v) for every v e S. Value J2ieE w l J2veS(l) TC(v) and 
values J2xeS(l) TC(#) for any / e E can be calculated with 
a bottom-up scan of T in a similar way as we computed 
TC su b(e) for all eeE. The remaining sums that involve 
edges in Anc(e) can be computed in linear time for every 
edge e with a similar mechanism as with SUM 3 (e) that 



Tsirogiannis and Sandel Algorithms for Molecular Biology 2014, 9:1 5 
http://www.almob.Org/content/9/1/15 



Page 6 of 16 



we described earlier in this proof. For any edge e e E, 
quantities PC(e) and PSQ(e) in Table 1 are equal to: 

PC(e) = cost(u, e) = TC su b(e) + wi • s(l) 

ueS /€lnd(e) 

+ J2 w l(s~s(l)), 

/eAnc(e) 

and: 

PSQ(e) = J2 C0St2 ( u > e ^ = s Qsub(e) 

ueS 

+ 2 w l ■ TQ ub (/) + wj • s(/) 

telnd(e) 

+ 2 E Wl E Wk ' Sk 

1 \ 



+ 2 wi(s-si) 



/eAnc(e) 



E 



. keAnc(e) 



+wj{s-s(l)). 



J 



From the two last expressions, and given the description 
that we provided for other similar quantities in Table 1, 
it easy to conclude that PC(e) can be evaluated for every 
edge e in 0(n) time by scanning T a constant number of 
times. Having computed PC(e) for all edges e e £, the 
quantity PSQ(e) can be computed in a similar manner. 

Next we describe a divide-and-conquer approach for 
computing in 0 (ri) time quantity (XI) in Table 1 for every 
e g E . Before we start our description, we define one more 
quantity that will help us simplify the rest of this proof. 
For an edge e e E and a tip u e S(e) we define that TC e (u) 
is equal to: 

TC e (u) = ^ cost(u, v) . 

veS(e)\{u} 

For any edge e e E it is easy to show that: 



w€5(e) ueS(e) 



(3) 



Therefore, according to (3) we can compute the sum 
J2ueS(e) TC e (u) for all edges e e Em linear time in total, 
given that we have already computed TC(e) for every e e 
£, and TC(u) for every u e S. 

Next we continue our description for computing QD(e) 
using a divide-and-conquer approach. We start with the 
base case; for every edge tree e that is adjacent to a leaf 
node we have: 



For any edge e e E that is not adjacent to a leaf node, 
we can calculate QD(e) using the values of the respective 
quantities of the edges in Ch(e): 



QD(e)= Q D ^ 

leCh(e) 

+ 2 ^ ^ ^ cost(u,v) • TQO) 

/€Ch(e) ueS(l) veS(e)\S(l) 

+ E E E cost{u > v) • (4) 

leCh(e) ueS(l) \veS(e)\S(l) / 



The first sum in (4) can be computed in 0(|Ch(e) |) time 
for each edge e, given that we have already computed the 
values QD(/) for every / e Ch(e). We leave the description 
for calculating the second sum in (4) for the end of this 
proof. The third sum in this expression is equal to: 



E v 



cost(u, v) 



/€Ch(e) ueS(l) \veS(e)\S(l) 



= ^ ^ I ^ costiu, I) + costiy, I) 

/GCh(e) ueS(l) \veS(e)\S(l) ; 

= E E E E ^st 2 (uj) 

leCh(e) ueS(l) veS(e)\S(l) xeS(e)\S(l) 

+ cost(u, I) • cost(v, I) + cost(u, I) - cost(x, I) 

+ costiy, I) - cost(x, I) . 



(5) 



The first term of the sum in (5) can be expressed as: 



E E E E 

/GCh(e) ueS(l) veS(e)\S(l) xeS(e)\S(l) 

= E ^ ~ s ^ 2 ' cost2 ( u > o 

leCh(e) ueS(l) 

= We)-5(/)) 2 .SQ sub (/), (6) 

leCh(e) 



QD(e)= ^ ^ cos£(>,v) 

wGS(e) \vG.S(e)\M > 



0. 



and can be computed in 0(|Ch(e)|) time, given that we 
have already computed SQ su b(/), V/ e Ch(e). 
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The next two parts of the sum in (5) are equal to: We left for the end the description of the calculation of 

the second sum in (4). We can express this sum as follows: 

E E E E cost(u,D.cost(v,i) T T T ««*(«, v).tc,(«) 

leCh(e) ueSQ) veS(e)\S(l) xeS(e)\S(l) ^ ^ ^ 

/eCh(e) ueS(l) veS(e)\S(l) 

+ cost(u, I) • cost(x, I) 

= E E E 2.(s( C )-5(/))-cort(«,0.cart(v,0 = E E E (costi^D + costivJ^.Tdiu) 

leCHe) ueS(l) veS(e)\S(l) leCh ^ ueS ^ v eS(e)\S(l) 

(5(e) -5(/)) V cosf(K,/) w/(5(e) -5(/)) v y /w 

V W V " ^ I ' V " /eCh(c) M eS(Z) veS(e)\S(l) 



/eCh(e) weS(/) 



+ £ (w^ • 5(/c)) - w/ • 5(/) + E E E cost(v,l)-TCi{u) 

keCHe) /eCh(e) «eS(/) veS(e)\S(Z) 

+ - E TC sub(/0 - Tc sub (/) = (5(e) _ . C05t(Mj /) . jq(u) 

= 2 ^ - - Tc sub(o • ( w/(5(c) - 5(0) + E E E co ^ (v ' Z) • TC ^ u) • (9) 

teCh(e) V /eCh(e) «€5(/) veS(e)\S(Z) 



+ E ^-5(/c) -W/-5(/)+ £ TC sub (/c) 
\A:eCh(e) / keCh(e) 

- TC sub (/) I . 



We start with the second sum in (9). For this sum we get: 

E E E costal) ^d(u) 

/eCh(e) weS(/) veS(e)\S(Z) 



= E E U/(^)-5(/))+ W k'SV0 
(7) /eCh(e) ueS{l) \ \keCh{e) } 



The last expression can be computed in 0 ( | Ch(e) | ) time / \ \ 

as well, if we have already computed the sum J2keCh(e) w k' ~ w l ' 5 (0 + I E TC su b(70 I — TC su b(0 I • TC/(w). 

s(k) and the quantity TC su b(e) for every edge e in the tree. \keCh(e) / / 

We can rewrite the remaining term in (5) as: Because of (3)> ^ ^ expression can be written as . 

E E E E «»t(v, /)•«»«(*,*) E E L (s(e) _ s(0)+ t {Wk . s( k))- wr s(i) 

= E E E «»*(". 0 1 +( 52 TC sub (A:) I - TC sub (/) ] • TQ(«) 

/eCh(e) «eS(/) \veS(e)\S(/) / \*€Ch(e) / / 

= E s(/) -( E co ^ ; )| = E [»>l(s(e)-s(D)+ ( £ W *.s(*)|- W/ . S (i) 

/eCh(e) \veS(e)\S(/) / — < 



E woo - + E w * • s ^ - w i-s® 

feCh(e) \ keCh(e) 



leCh(e) \ \keCh(e) 

+ I TC sub«)-TC sub f/))[ TC(«)-TC(e)|, 



2 



which takes 8(|Ch(e)|) time to be computed for each 
_ edge e. 

+ E TC subW - TC sub (/) . (8) To compute the first sum in (9) efficiently, we need to 

keCh(e) / precompute for every edge / e E the following quantity: 

The last expression can be computed in @(|Ch(e)|) time cost(u,e) • TC e (u) . 

in a similar way as the previous terms of the sum in (5). ueS(e) 
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To do this, we follow again a divide-and-conquer 
approach. We get the base case for this computation for 
the edges of T that are adjacent to tips. For any such edge 
e we have: 

cost(u, e) • TC e (u) = 0 . 

ueS(e) 

For any other edge e e E we can compute this quantity 
based on the respective quantities of the edges in Ch(e). 
In particular, we have that: 

cost(u,e) • TC e (u) = ^ ^ cost (u, I) • TQ(w) 

ueS(e) /eCh(e) ueS(l) 

+ X w i X TC ^ u) + 

/eCh(e) weS(/) /eCh(e) 
cost(u, e) • cost(u, v) 

ueS(l) veS(e)\S(l) 

= ^ ^ cost(u,l)-TCi(u)+ y^ 

/eCh(e) weS(/) /eCh(e) 

xw/ ^ TC(m)-TC(/) )+ X! 

\weS(/) / /eCh(e) 

y^ y^ cost(u, e) • cost(u, v) . 

ueS(l) veS(e)\S(l) 

(10) 

The first two sums in the last expression can be com- 
puted in 0(|Ch(e)|) time, given that we have computed 
already for every / e Ch(e) the quantity TC(/) and the 
sum ^2 ue $(i) TC(w) (can be done with a single bottom-up 
scan of the tree). The last sum in (10) can be expressed 
as: 



y^ y^ y^ cost(u, e) • cost(u, v) 

/eCh(e) KeS(Z) v€S(e)\S(Z) 

= XI Wl X X C05 ^> v ) 

/eCh(e) ueS(l) veS(e)\S(l) 

+ ^ ^ ^ costiu, I) • cost(u, v) 

/eCh(e) mg5(/) veS(e)\S(/) 

= X W/ X X C05 ^ v ) 

/GCh(e) wgS(/) v€S(e)\S(/) 

+ X X X ^st 2 (u,i) 

/GCh(e) ueS(l) veS(e)\S(l) 

+ costiu, I) - costiy, I) . 

/€Ch(e) wg5(/) veS(e)\S(l) 



(ii) 



The two last sums in (11) are identical with the quanti- 
ties that we analysed in (6) and in (7). Finally, the first sum 
in (11) is equal to: 

X Wl X X cost ^ v ) 

/GCh(e) ueS(l) veS(e)\S(l) 

= m(s(e)-s(l))-TC suh (l) 

leCh(e) 

+ *r s(D - (s(e) - s(l)) 

/eCh(e) 

+ X w l\ s «A X s{k)-w k \-s 2 (l)-wi\ 

/GCh(e) \ \keCh(e) J J 

+ X Wl X T Csub(/0]-TC sub (/) j, 

leCh(e) \\keCh(e) / / 

(12) 

which can also be computed in @(|Ch(e)|) time. 

All the sums that we analysed from (4) up to (12) can be 
computed in 0(|Ch(e)|) time for every edge e in the tree. 
From this we conclude that for every edge e e E we can 
evaluate QD(e) in (4) in 0(|Ch(e)|) time from the respec- 
tive values of the edges in Ch(e). Since J2eeE |Ch(e) | = 
we prove that we can compute QD(e) for all the 
edges in T in © in). □ 

Computing the skewness of the MPD 

In the previous section we defined the problem of com- 
puting the skewness of the MPD for a given phylogenetic 
tree T. Given a positive integer r < s, we showed that 
to solve this problem efficiently it remains to find an effi- 
cient algorithm for computing £# G sub(.S,r)[MPD 3 (T,i?)]; 
this is the mean value of the cube of the MPD among 
all possible subsets of tips in T that consist of exactly r 
elements. To compute this efficiently, we introduced in 
Table 1 eleven different quantities which we want to use in 
order to express this mean value. In Lemma 1 we proved 
that these quantities can be computed in 0(n) time, where 
n is the size of T. 

Next we prove how we can calculate the value for the 
mean of the cube of the MPD based on the quantities 
in Table 1. In particular, in the proof of the following 
lemma we show how the value £# G sub(.s,r)[MPD 3 (T,^)] 
can be written analytically as an expression that con- 
tains the quantities in Table 1. This expression can then 
be straightforwardly evaluated in 0(n) time, given that 
we have already computed the aforementioned quantities. 
Because the full form of this expression is very long (it 
consists of a large number of terms), we have chosen not 
to include it in the definition of the following lemma. We 
chose to do so because we considered that including the 
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entire expression would not make this work more read- 
able. In any case, the full expression can be easily infered 
from the proof of the lemma. 

Lemma 2. For any given natural r < s, we can compute 
^€5^(5,r)[MPD 3 (T,^)] in 0(h) time. 



Proof. The expectation of the cube of the MPD is equal 



to: 



^GSub(5,r)[MPD 3 (T,i?)] 



r 5 (r — l) 6 



• EReSub(S,r) XI XI XI 

{u,v}eA(R) {x,y}eA(R) {c,d}eA(R) 

cost(u,v) • cost(x,y) • cost(c,d) 



From the last expression we get: 

EReSub(S,r) 



E E E 

u,v}eA(R) {x,y}eA(R) {c,d}eA(R) 

cost(u,v) • cost(x,y) • cost(c,d) 



= ^ ^ cost(u,v) - cost(x,y) 

{u,v}eA(S) {x,y}eA(S) {c,d}eA(S) 

• C0St(c,d) •jBR 6 Sub(5,r)[AP J R(M,V,^,y,C,rf)] > ( 13 ) 

where APr(u, v, x,y, c, d) is a random variable whose value 
is equal to one in the case that u, v, x, y,c,d e R, otherwise 
it is equal to zero. For any six tips u, v, x, y,c,d e S, which 
may not be all of them distinct, we use 0 (u, v, x, y, c, d) to 
denote the number of distinct elements among these tips. 
Let t be an integer, and let (t)k denote the /c-th falling fac- 
torial power of t, which means that = t(t — 1) . . . (t — 
k + 1). For the expectation of the random variables that 
appear in the last expression it holds that: 

£i?eSub(S,r) [APr(u, v, x,y, c, d)] = ^ e ( u > v > x >y> c > d ^ ( 14 ) 

\ s )0(u,v,x,y,c,d) 

Notice that in (14) we have 2 < 0 (u, v, x, y, c, d) < 6. The 
value of the function #(•) cannot be smaller than two in 
the above case because we have that u ^ v, x ^ y, and 
c ^ d. Thus, we can rewrite (13) as: 



^ ^ 4-^/ (c) , 

{u,v}eA(S) {x,y}eA(S) {c,d}eA(S) v w \WW> a ) 



E( r )9(u,v,x,y,c,d) , , x 
• C0St(U, V) 



• cost(x,y) - cost(c,d) 



(15) 



Hence, our goal now is to compute a sum whose ele- 
ments are the product of costs of triples of paths. Recall 
that for each of these paths, the end-nodes of the path 



are a pair of distinct tips in the tree. Although the end- 
nodes of each path are distinct, in a given triple the paths 
may share one or more end-nodes with each other. There- 
fore, the distinct tips in any triple of paths may vary from 
two up to six tips. Indeed, in (15) we get a sum where the 
triples of paths in the sum are partitioned in five groups; 
a triple of paths is assigned to a group depending on the 
number of distinct tips in this triple. In (15) the sum 
for each group of triples is multiplied by the same factor 
(r)e(u,v,x,y,c,d)/(s)e(u,v,x,y,c,d)> hence we have to calculate the 
sum for each group of triples separately. 

However, when we try to calculate the sum for each of 
these groups of triples we see that this calculation is more 
involved; some of these groups of triples are divided into 
smaller subgroups, depending on which end-nodes of the 
paths in each triple are the same. To explain this better, we 
can represent a triple of paths schematically as a graph; let 
{u, v}, {x,y}> {c, d} e A(S) be three pairs of tips in T. As 
mentioned already, the tips within each pair are distinct, 
but tips between different pairs can be the same. 

We represent the similarity between tips of these three 
pairs as a graph of six vertices. Each vertex in the graph 
corresponds to a tip of these three pairs. Also, there exists 
an edge in this graph between two vertices if the corre- 
sponding tips are the same. Thus, this graph is tripartite; 
no vertices that correspond to tips of the same pair can be 
connected to each other with an edge. Hence, we have a 
tripartite graph where each partite set of vertices consists 
of two vertices-see Figure 1 for an example. 

For any triple of pairs of tips {u, v}, {x,y}> {c, d] e A(S) 
we denote the tripartite graph that corresponds to this 
triple by G[u, v, x,y, c, d\. We call this graph the similarity 
graph of this triple. Based on the way that similarities may 
occur between tips in a triple of paths, we can partition 
the five groups of triples in (15) into smaller subgroups. 
Each of these subgroups contains triples whose similarity 
graphs are isomorphic. For a tripartite graph that con- 
sists of three partite sets of two vertices each, there can 
be eight different isomorphism classes. Therefore, the five 



a 



a (3 7 5 e C 



O— O ; O 

o i o-^-o 



7] 



Figure 1 Representing triples of paths as graphs, (a) A 

phylogenetic tree T and (b) an example of the tripartite graph 
induced by the triplet of its tip pairs {a, y), {8, y), {<?, 8), where 
{a, y, 8, 6} c S. The dashed lines in the graph distinguish the partite 
subsets of vertices; the vertices of each partite subset correspond to 
tips of the same pair. 
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groups of triples in (15) are partitioned into eight sub- 
groups. Figure 2 illustrates the eight isomorphism classes 
that exist for the specific kind of tripartite graphs that we 
consider. Since we refer to isomorphism classes, each of 
the graphs in Figure 2 represents the combinatorial struc- 
ture of the similarities between three pairs of tips, and it 
does not correspond to a particular planar embedding, or 
ordering of the tips. 

Let X be any isomorphism class that is illustrated in 
Figure 2. We denote the set of all triples of pairs in A (S) 
whose similarity graphs belong to this class by Bx. More 
formally, the set Bx can be defined as follows : 

B x = {{{u,v},{x,y},{c,d}} : {u J v} 1 {x J y} 1 {c J d} e A (5) 

and G[ u, v, x, y, c, d] belongs to class X in Figure 2} . 

We introduce also the following quantity: 

TRS(X) = ^2 cost(u, v) - cost (x,y)- cost (c, d) . 

{{u,v},{x,y},{c,d}}et3 x 

Hence, we can rewrite (15) as follows: 

(rh (rh 
— • TRS(A) + 3 • — ■ TRSCB) + 6 • — • TRS(C) 

(sh (s) 3 (s) 3 

+ 6 • — • TRSCD) + 3 • — • TRS(£) + 6 • — • TRS(F) 

(5)4 W4 W4 



+ 6 



GOs 



TRS(G) + 6 



(S)6 



TRS(H) 



(16) 



Notice that some of the terms & • TRS(X) in (16) are 
multiplied with an extra constant factor. This happens 
for the following reason; the sum in TRS(X) counts each 
triple once for every different combination of three pairs 
of tips. However, in the triple sum in (15) some triples 
appear more than once. For example, every triple that 
belongs in class B appears three times in (15), hence there 
is an extra factor three in front of TRSCB) in (16). 

To compute efficiently £^ € sub(5,r)[MPD 3 (T,^)], it 
remains to compute efficiently each value TRS(X) for 



every isomorphism class X that is presented in Figure 2. 
Next we show in detail how we can do that by expressing 
each quantity TRS(X) as a function of the quantities that 
appear in Table 1. 

For the triples that correspond to the isomorphism class 
A we have: 

TRS(A) = ™st 3 (u, v) = CB(T) . 

{u,v}eA(S) 

For TRS(£) we get: 

TRSCB) = ^ cost 2 (u,v)l ^ cost (u,x) 
{u,v}eA(S) \xeS\{u} 

+ ^ costiy, y) — 2 • cost(u, v) I 
yeS\{v} ) 

= ^ cost2 ( u > v ) (TC(w) + TC(v) - 2 • cost(u,v)) 

{u,v}eA(S) 

= J2 s QO) • TC( » - 2 • CB en • 

ueS 

The quantity TRS(C) is equal to: 

- ^2 cos t( u > v ) cost(u, x) • cost(x, v) 



ueS veS\{u} 



xeS\{u,v} 



= ^2 w e ^2 5^ C0St(u, X) • C0St(x, v) . 

eeE ueS(e) veS—S(e) xeS\{u,v} 

(17) 

For any e e Ewe have that: 

costtu, x) • cost(x, v) 

ueS(e) veS—S(e) xeS\{u,v} 

— ^2 ^2 cost(u, x) - cost(x, v) 

ueS(e) veS\{u} xeS\{u,v} 

— 2 ^2 ^2 cost ( u > x ) ' cos t(x> v) . (18) 

{u,v}eA(S(e)) xeS\{u,v} 



a 



0—0 



o 



o— o 




3 



d 



o 



o 



o 



e 0—0 

o— o 



o 
o 



o — o o 
o o — o 



O-i-O 

o i o 



o 
o 



o 
o 



o 
o 



o 
o 



Figure 2 Isomorphism classes. The eight isomorphism classes of a tripartite graph of 3 x 2 vertices that represent schematically the eight possible 
cases of similarities between tips that we can have when we consider three paths between pairs of tips in a tree T. 
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The first of the two sums in (18) can be written as: 

y^ y^ y^ cost(u, %) • cost(%, v) 

vgS\M *gS\{w,v} 

= ^ ^ ^ costiu, v) • cost(x, v) 

ueS(e) veS\{u} xeS\{u,v} 

= ^2 (cost(u,v) • TC(v) — cost 2 (u, v)) 



= J] SM(k)-SQ(k). 



(19) 



wG.S(e) 



According to Lemma 2, we can compute SM(u) and 
SQ(w) for all tips u e S in linear time with respect 
to the size of T. Given these values, we can compute 
2^weS(e) SM(h) — SQ(u) for every edge e e E in T with a 
single bottom-up scan of the tree. For any edge e in £, the 
second sum in (18) is equal to: 

y^ costiu, x) - cost(x, v) 

{u,v}eA(S(e)) xeS\{u,v} 

= ^ ^ cost(u, x) - cost(x, v) 

{u,v}eA(S(e)) xeS(e)\{u,v} 

+ ^ ^ cost(u, x) - cost(x, v) . (20) 

{w,v}€A(5(e))^G5\5(e) 

We can express the first sum in (20) as: 
y^ y^ cost(u, x) - cost(x, v) 

{u,v}eA(S(e)) xeS(e)\{u,v} 

ueS(e) \veS(e)\{u} J 

— - ^2 cos t 2 ( u > v ) 

ueS(e) veS(e)\{u} 

= ^QDw -\J2 X cost2 ( u > v > • w 

ueS(e) veS(e)\{u} 
The last sum in (21) is equal to: 

J2 J2 cost2 ( u > v ) = [ SQ( ^ ) ~ SQW • 

ueS{e) veS{e)\{u} \ueS(e) / 

(22) 

The value of the sum J2ueS(e) SQ( M ) can be computed 
for every edge e in 0 (ri) time in total as follows; for every 
tip u e S we store SQ(u) together with this tip, and then 
scan bottom-up the tree adding those values that are in 



the subtree of each edge. For the remaining part of (20) we 
get: 

y^ y^ cost(u, x) - cost(x, v) 

{u,v}eA{S{e)) xeS\S(e) 

= ^2 (cost(u, e) + cost(x, e)) 

{u,v}eA{S{e)) xeS\S(e) 

x (cost(v, e) + cost(x, e)) 
= ^2 cost(u, e) - costiy, e) 

{u,v}eA(S(e)) xeS\S(e) 

+ ^2 cost(x, e)-(cost(u, e)+cost(y, e)) 

{u,v}eA(S(e))xeS\S(e) 

+ ^ XI cost2 ( x > e )- ( 23 ) 

{u,v}eA(S(e)) xeS\S(e) 

The first sum in (23) is equal to: 

y^ y^ cost(u, e) - costiy, e) 

{u,v}eA(S(e))xeS\S(e) 

= \'{s- s(e)) (TC sub 2 (e) - SQ sub (e)) . (24) 
For the second sum in (23) we have: 
y^ y^ cost(x, e) • (cost(u, e) + cost(y, e)) 

{u,v}eA(S(e))xeS\S(e) 

= (s(e) - 1) • TC sub (e) ^ cost (^ e ) 

xeS\S(e) 

= (s(e) - 1) • TC sub (e) • (PC(e) - TC sub (e)) . (25) 
The last sum in (23) can be written as: 
y^ y^ cost 2 (x, e) 

{u,v}eA{S{e)) xeS\S(e) 

s(e)(s(e)-l) 



(PSQ(e)-SQ sub (e)) . 



(26) 



Combining the analyses that we did from (17) up to (26) 
we get: 

TRS(C) = J2 W * I SM(w) ~ QD(e) ~ SQ(e) 

eeE \ueS(e) 

-(s-s(e)) (TC sub 2 (e) - SQ sub (e)) 

- 2(s(e) - 1) • TC sub (e) • (PC(6) - TC sub (£0) 

- s(e) (s(e) - 1) • (PSQ(e) - SQ sub (e)) I . 
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The value of TRS(X>) can be expressed as: 

cost(u, v) • cost(u,x) - cost(u,y) 

ueS v,x,yeS\{u} 
v,x,yare distinct 

= - I J2 TC 3 (u) - 2 • TRS(A) - 3 • TRS(£) J 
6 \ues I 

= \'T TC 3 (u) + \ • CB(T) - \- T SQ(«) • TC(«) . 

WG.S WGo 

For TRS(£) we get: 

^ ^ C0St 2 (u, V) • C0St(x,y) 

{«,v}eA(S) {*,y}eA(S\{w,v}) 

= X cost 2 (u, v)(TC(T) - TC(m) - TC(v) + co5t(w, v)) 

{w,v}eA(S) 

= TC(T) XI w e ■ TC ( e ) - I] (SQ(k) • TC(«)) + CB(T) . 

We can rewrite TRS(F) as follows: 
y^ cost{u y v) 

{u,v}eA(S) 

(TC(u) • TC(v) — cost 2 (u, v) — X cost(u,x) • cost(x, v) I 
*eS\{w,v} / 

= X cos*(m, v) • TC(k) • TC(v) - CB(T) - 3 • TRS(C) 

{u,v}eA(S) 

= Y^ w e- Mult(e) - CB(T) - 3 • TRS(C) . 

eeE 

For the value of TRS(G) we have: 

TRS(G) = - ^ cost(u, v) ^ (cost(u,x) 

{u,v}eA(S) xeS\{u,v} 

+ cos£(v,*)) (TC(T) - TC(k) - TC(v) 

— TC(#) + cost(u, v) + cost(u,x) + cas£(v,#)) . 

(27) 

We now break the sum in (27) into five pieces and 
express each piece of this sum in terms of the quantities in 
Table 1. The first piece of the sum is equal to: 



The second piece that we take from the sum in (27) can 
be expressed as: 

- - ^ cost(u, v) ^ (cost(u,x) 

{u,v}eA(S) xeS\{u,v} 

+ cost(v,x)) (TC(k) + TC(v)) 
1 ^ 

= — 2^ cost(u, v) (TC(k) + TC(v) 

{u,v}eA(S) 

- 2 • cost(u, v)) (TC(u) + TC(v)) 
= - - v ) ( TC;2 

{w,v}gA(S) 

+ TC 2 (v) + 2-TC(w)-TC(v) 

- 2 • cos*(m, v) • (TC(k) + TC(v))) 

= - - Tc3 («) - XI c ^( v > *) * TC ( V ) • TC (*) 



- ^ cost(u, v) (cost(u,x)+cost(v,x))-TC(T) 

{u,v}eA(S) xeS\{u,v} 

= - • TC(T) I XI T C 2 (w) - 2 • J] co5/: 2 (w, v) J 
\wgS {w,v}gA(S) / 

= X - • TC(T) \TTQ 2 (u) - 2 • XX ' TC ^ ) • 

VmgS sg£ / 



ueS 



{v,x}eA(S) 



+ J] ™5£ 2 (y^)(TC0/) + TC(z)) 

{y^}€A(5) 

= -^E tc3(w) - 1>« • Mult(e) 



ueS 



eeE 



+ XI SQ(m) • TC(k) . 



ueS 



(28) 



The next piece that we select from (27) is equal to: 

— - ^2 cost(u,v) ^2 (cost(u,x) 

{u,v}eA(S) xeS\{u,v} 

+ cost(v,x)) - TC(x) 

= — - ^2 cost(u,v)(SM(u) 

{u,v}eA(S) 

+ SM(v) — cost(u, v) - TC(u) — cost(u, v) • TC(v)) 

= -^SM(m)-TC(«) 

ueS 

+ - XI cost^iu, v) (TC(a) + TC(v)) 

{w,v}gA(S) 

= - - X) sm (^) • tc (^) + - XI SQ( ^ • TC ^ • 



wg5 



wgS 



(29) 
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For the fourth piece of the sum in (27) we get: 



- ^2 cost 2 (u, v) (cost(u,x) + cost(v,x)) 

{u,v}eA(S) xeS\{u,v} 

= \ • TRS(£) = W SQ(«) • TC(«) - CB(T) . 



ueS 



(30) 



The last piece of the sum in (27) can be expressed as: 



- cost(u,v) ^2 (cost(u,x) + cost(v,x)) 2 

{u,v}eA(S) xeS\{u,v} 

= - ^2 cost(u,v) ^2 (cost 2 (u,x) 
{u,v}eA(S) xeS\{u,v} 

+ cost 2 (v,x)) +3 • TRS(C) 

1 ^ 

= - 2^ cost(u, v)(SQ(u) + SQ(v) 

{u,v}eA(S) 

-2- cost 2 (u, v))+3-TRS(C) 

= -J2 s Q (w) • TC(w) ~ CB(r) + 3 • TRS ( C ) • ( 31 ) 

ueS 

Combining our analyses from (27) up to (31) we get: 



We can express TRS(H) using the values of the other 
isomorphism classes: 



TRS(/f) = - [ J2 E E cost ( u > v > 

\{w,v}gA(5) {x,y}eA(S) {c,d}eA(S) 
• cost(x,y) - cost(c,d) — TRS(A) 

- 3 • TRSOB) - 6 • TRS(C) - 6 • TRS(£>) 

- 3 • TRS(£) - 6 • TRS(F) - 6 • TRS(G) ] 



- • TC 3 (T) - - • TRS(A) - - • TRS(£) 
6 6 2 



TRS(C) - TRSCD) - - • TRS(£) 
TRS(F) - TRS(G) . 



We get the value of E ReSu b(S,r) [ MPD 3 (T, R)] by plugging 
into (16) the values that we got for all eight isomorphism 
classes of triples. For any isomorphism class X we showed 
that the value TRS(X) can be computed by using the quan- 
tities in Table 1. The lemma follows from the fact that each 
quantity that appears in this table is used a constant num- 
ber of times for computing value TRS(X) for any class X, 
and since we showed that we can precompute all these 
quantities in © (n) time in total □ 

Theorem 3. Let T be a phylogenetic tree that contains s 
tips, and let rbea natural number with r < s. The skewness 
of the mean pairwise distance on T among all subsets of 
exactly r tips ofT can be computed in 0 (ri) time. 



TRS(G) = - • TC(T) • J]TC 2 (w) 

ueS 

- TC(T) • J2 w * ' TC ^ Tq3 ^ 

eeE ueS 

- J2 w e ' Mult(e) ~\'J2 SM(w) ' TC(w) 



eeE 



ueS 



+ -.£SQ(«)-TC(«) 



ueS 



- 2 • CB(T) + 3 • TRS(C) . 



Table 2 The sizes of tip samples that we considered for our 
experiments, together with the number of sets that we 
sampled for each tip size in order to derive the "true" 
values 



Size of each tip sample 


Number of sampled sets 


10 


10 5 


20 


10 5 


40 


5-10 4 


80 


3-10 4 


160 


2 • 10 4 


320 


10 4 
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MPD 40 



MPD 160 






Oe+00 2e+04 4e+04 6e+04 8e+04 1e+05 



10000 20000 30000 40000 50000 



Sample size 

Figure 3 Error in calculating the distribution that is used as point of reference. Error in lvalue estimation as the number of tree set resamples 
increases of tip set sizes of 10, 40 and 160. The dotted lines show errors of 0.05 MPD units, illustrating that the number of resamplings used here 
were sufficient to estimate percentiles to within 0.05 distance units in each case. 



Proof According to the definition of skewness, as it 
is also presented in (1), we need to prove that we can 
compute in 0(h) time the expectation and the vari- 
ance of the MPD, and the value of the expression 
£i?eSub(.S,r)[MPD 3 (T,#)]. In a previous paper we showed 
that the expectation and the variance of the MPD can be 
computed in 0 (n) time. By combining this with Lemma 2 
we get the proof of the theorem. □ 

Experiments: improved P value estimation 
incorporating skewness 

Earlier in this paper, we mentioned that distributions of 
MPD values are often found to be skewed, suggesting that 



it is necessary to incorporate this skewness into analyt- 
ical P value estimation. However, it is unclear whether 
good P value estimates are possible with only the first 
three moments of the distribution, or if more detailed 
distributional information is required. 

We investigate this question here by considering ran- 
dom phylogenetic trees produced by a pure birth pro- 
cess [12], though results were qualitatively identical when 
using trees generated by a combined birth-death process 
(and skewness did not vary as a function of the death 
rate). We took two approaches for estimating the posi- 
tion of the 2.5 and 97.5 percentile of MPD distribution 
given a particular tree instance. For any tree T that we 










100 








500 








1000 






















• 


• • • 


1 

10 


I 

20 


I 

40 


i i i 

80 160 320 



Tip set size Tip set size 

Figure 4 Comparison of approximation methods. Errors in P value approximation using different resampling replicates (indicated by the 
coloured lines), compared to that obtained by assuming a skew-normal distribution of MPD values (indicated as SN). Errors were strongly influenced 
by tip set size r, and weakly by tree size; on the left side appear the results for a 500 tip tree, and on the right for a 2000 tip tree). In most cases, P 
value approximation based on the skew-normal distribution performed better than the most commonly-used standard of 1 000 set resamplings 
(blue line), and the relative performance of the skew-normal approach improved with increasing tip set size. 
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constructed, we first calculated the distribution of the 
MPD values using as a point of reference extensive sam- 
pling of sets of tips (much more extensive than is usually 
employed in practice). In particular, for specific values of r 
we sampled from T a large number of sets that consist of 
exactly r tips (see Table 2 for the values of r and numbers 
of sets that we sampled). We simply calculated the per- 
centiles of these distributions, and call these the reference 
values, recognizing that they neveretheless contain some 
error, being incomplete samples from the tree. Complete 
sampling from large trees is computationally infeasible, 
but we estimate that the error in the calculated percentiles 
was less than 0.05 distance units in all cases (correspond- 
ing to an error of approximately 0.01% relative to the mean 
MPD-see Figure 3). 

The two approaches that we used to estimate the per- 
centile positions reflect two alternatives that might be 
employed by practising researchers. In the first approach, 
for each value r that we considered, we sampled again sev- 
eral sets of tips, yet much fewer than the ones we used 
to calculate the reference values (100, 500, 1000 or 5000 
sets). We then compare the absolute difference between 
percentiles estimated in this manner and the reference 
values. We refer to this difference as the error between the 
estimated percentile values and the reference values. The 
second approach uses the mean, variance and skewness of 
the MPD distribution to determine the position of the 2.5 
and 97.5 percentile of the skew-normal distribution with 
these moments [13]. The mean, variance and skewness 
were computed in this case based on all the MPD values 
that we used to calculate reference percentiles. Although 
we have implemented algorithms for computing the exact 
values of the mean and variance of the MPD, we have 
not implemented so far the algorithm that computes the 
skewness of the MPD; that is the algorithm outlined in 
the previous sections of this paper. As with the previ- 
ous approach, the error of this approximation method 
was calculated by taking the absolute difference between 
each estimated percentile position and the corresponding 
reference value. 

The experiment described above was repeated across 
100 replicate trees of each of two sizes (500 and 2000 
tips), and across a range of tip set sizes (10, 20, 40, 80, 
160 and 320). Errors were weakly related to tree size 
but decreased strongly with tip set size-see Figure 4. 
This decrease was more pronounced for estimates based 
on skew-normal approximation than resampling. Notably, 
the skew-normal approximation yielded smaller errors 
than the most commonly used standard of 1000 resam- 
plings for all but the smallest tip set sizes. 

Thus, we conclude that the errors introduced by assum- 
ing a skew-normal distribution of MPD values appear 
to be comparable to or smaller than those introduced 
by standard resampling procedures, while also showing 



better scaling with increased tip sample size. Finally, the 
computation of P values using skew-normal approxima- 
tion is typically faster than with resampling, particularly 
in cases involving large samples of trees. 

Conclusions 

Given a rooted tree T and a non-negative integer r, we 
proved that we can compute the skewness of the MPD 
among all subsets of r leaves in T in 0(n) time. An 
interesting problem for future research would be to imple- 
ment the algorithm that is outlined by our proof, and 
show its efficiency in practice. Also, it would be interest- 
ing to derive a similar result for the so-called Community 
Distance measure; this is the equivalent of the MPD when 
distances between two sets of species are considered [11]. 
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